Simulation of the matrix Bingham-von Mises- Fisher distribution, 
with applications to multivariate and relational data 



Peter D. Hoff * 
February 5, 2008 



Abstract 

Orthonormal matrices play an important role in reduced-rank matrix approximations and the 
analysis of matrix- valued data. A matrix Bingham-von Mises-Fisher distribution is a probability 
distribution on the set of orthonormal matrices that includes linear and quadratic terms, and 
arises as a posterior distribution in latent factor models for multivariate and relational data. 
This article describes rejection and Gibbs sampling algorithms for sampling from this family of 
distributions, and illustrates their use in the analysis of a protein-protein interaction network. 



Some key words: Bayesian inference, eigenvalue decomposition, Markov chain Monte Carlo 
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1 Introduction 

Normal vectors and orthonormal matrices play an important role in spatial statistics, multivariate 
analysis and matrix decomposition methods. The set of rank- R Rxm orthonormal matrices is called 
the Stiefel manifold and is denoted VR >m . Probability distributions and statistical inference for data 
from this manifold have been developed primarily in the spatial statistics literature, particularly for 
the case of points on a sphere {R = l,m = 3). Theoretical treatments of probability distributions 
on higher dimensional manifolds have been given in |Gupta and Nagar| [2000] and |Chikuse| [2003| . 



Many of the commonly used probability distributions on Vi^ m have exponential family forms. 
For example, a flexible probability density having linear and quadratic terms is given by 



P bmf{X\A, B, C) oc eti{C T X + BX 1 AX), (1) 
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where A and B can be assumed to be symmetric and diagonal matrices, respectively. This density, 
introduced by Khatri and Mardia |1977 , is called the matrix Bingham-von Mises- Fisher density or 
the matrix Langevin-Bingham density. 

Relevant to the modeling of spatial data, Wood [1987 describes a method for simulating from 
densities of this type in cases where R = 1 and m = 3, and Kent et al. |2004 describe methods 
for a complex version of this distribution that is feasible for R = 1 and small values of m. Kume 



and Walker |2006 describe a Gibbs sampling scheme for the case R = 1 and C = 0. However, 



there is a need for statistical and computational tools for data from higher dimensional manifolds: 
Large, matrix-variate datasets are frequently analyzed and described using matrix decomposition 
techniques, in which heterogeneity across rows and columns are represented by low-rank orthonor- 
mal eigenvector matrices. Probability models for these matrices provide a framework for describing 
variability and uncertainty in matrix variate-data. In particular, distributions of the form ([I]) arise 
as posterior distributions in many models for multivariate and relational data: 



Example (Factor analysis): Let Y be an n x p data matrix representing n samples from a 
p-variate distribution. If p is large or the columns are highly correlated, it may be desirable 
to represent yi, the p measurements within row i, as linear functions of R < p latent factors 
Ui = {u iy x, . . . ,u ijR }: 

Vi,i = ujDvj + ( ,. ; 
Y = UDV T + E 

In this parameterization, the nx R and px R matrices U and V can be assumed to be orthonormal 
matrices, elements of Vj? jn and Vr )P respectively. In situations involving ordinal or missing data it 
may be desirable to take a likelihood-based approach to estimation of U, D and V. If the error 
matrix E is made up of independent and identically distributed normal variates, then uniform prior 
distributions for U and V imply that 

p{U\Y,D,V) oc etr([Y T VD] T U/a 2 ) 
p(V\Y,D,U) cx etr([YUD] T V/a 2 ), 

which are matrix von Mises-Fisher distributions. Joint posterior inference for U and V can be 
obtained by iteratively sampling from these two distributions. 

Example (Principal components): Again, let Y be an n x p data matrix where the rows 
are assumed to be independent samples from a mean-zero multivariate normal population with 
covariance matrix S. Writing S via its eigenvalue decomposition S = UAU T , the probability 
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density of the data is 

p(y|£) = (2vr)-' ip / 2 |Sr' 1 / 2 etr(-yS- 1 F T /2) 

= (2vr)-^/ 2 fl XJ^etri-A^U^YU/l). 

3=1 

A uniform or matrix von Mises-Fisher prior distribution on U £ V PiP results in a Bingham-von 
Mises-Fisher posterior distribution. This result could be useful if one wanted to use a non-standard 
prior distribution for the eigenvalues of S, or if there was prior information about the principal 
components. 

Example (Network data): Network data consist of binary measurements on pairs of objects or 
nodes. Such data are often represented as a graph in which a link between two nodes indicates the 
presence of a relationship of some kind. Alternatively, the data can be represented with a binary 
matrix Y so that yij is the 0-1 indicator of a link between nodes i and j (the diagonal of Y is 
generally undefined). One approach to modeling such data is to use a latent factor model with a 
probit link: 

Vi,j = S (c,oo)( z i,j) 
Zij = ufAuj + eij 
Z = UAU T + E, 

where E is modeled as a symmetric matrix of independent standard normal noise, A is a diagonal 
matrix and U is an element of Vr^, with R generally taken to be much smaller than m. Such a 
model can be thought of as a latent eigenvalue decomposition for the graph Y. Given a uniform 
prior distribution for U, we have 

p{U\Z,A) oc etr{Z T UAU T /2) 
= eti{AU T ZU/2), 

which is a Bingham distribution with parameters A = Z/2 and B = A. 

Section 2 of this article describes a rejection sampling method for the matrix von Mises-Fisher 
distribution, and a Gibbs sampling algorithm is provided for cases in which the rejection method 
is infeasible. Section 3 presents a Gibbs sampling algorithm for generating random matrices of 
arbitrary dimension from the Bingham-von Mises Fisher (BMF) distribution. Specifically, I show 
how to construct a Markov chain in A € VR :Jn , samples from which converge in distribution to Pbmf- 
Section 4 implements the sampling algorithms in the context of a data analysis of the interaction 
network of 270 proteins. In this example the ability to sample from the BMF distribution allows 
for Bayesian inference and estimation. A discussion follows in Section 5. 
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2 Sampling from the von Mises-Fisher distribution 



The vector von Mises-Fisher (MF) distribution on the m-dimensional sphere has a density with 
respect to the uniform distribution given by 

|| c /2||m/2-l 

Pmf(x\c) = — — — y—--exp{c T x}, 

r(m/2)/ m/2 _i(||c||) 



where I u {-) is a modified Bessel function of the first kind. Wood [1994 provides a sampling scheme 
for this distribution that is relatively straightforward to implement. In this section we show how 
this ability to sample from the vector-valued density can be used to sample from matrix MF 
distribution, having density 

where oFi(hm; \D 2 ) is a type of hypergeometric function with a matrix argument 



Herz 



1955 



'H2 m > 4 

We first present a rejection sampling approach in which a rejection envelope is constructed out 
of a product of vector-valued densities. A second approach is by iterative Gibbs sampling of the 
columns of X. 

2.1 A rejection sampling scheme 

An important reparameterization of the matrix MF density on Vn. m is via the singular value 
decomposition of C, whereby C = UDV T , with U and V being m x R and R x R orthonormal 
matrices, and D a diagonal matrix with positive entries. Using these parameters, the density of X 
can be written as pmf(X\U, D,V) cx eti(VDU T X) = eti(DU T XV). This density is maximized at 
X = UV T , which can be interpreted as the modal orientation of samples from the population. The 
entries of D can be interpreted as concentration parameters, describing how close the samples are 
to the mode. 

As pointed out in Chikuse |2003 , one way to generate samples from the matrix MF distribution 



is with rejection sampling based on a uniform envelope: Since the density is maximized in X by 
UV T , the ratio between pmf and the uniform density g u on Vjj, m is bounded by 

Pmf{X) etr (DU T [U T V]V) _ etr(D) 



9u(X) Fi(±m; \D*) F 1 (\m;lD^ 

If independent pairs X ~ g u and u ~ uniform(0, 1) are repeatedly sampled until u < etr(C T X — D), 
then the resulting X has density pmf- As Chikuse points out, such a procedure will be extremely 
inefficient for most C of interest, due to the poor approximation of pmf by g u . 

A much better approximation to pmf can be constructed from a product of vector von Mises- 
Fisher densities: If H is an orthogonal matrix, then samples X from MF(iJ) will be such that each 
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column vector X^ is close to the direction of . This suggests generating proposal samples 
from a von Mises-Fisher density with concentration vector H^ r j , although constrained to be orthog- 
onal to the other columns of X. With this in mind, consider the density on Vr^ corresponding to 
the random matrix X sampled as follows: 

1. sample X [;1] ~ MF(i7[ ;1 ]). 

2. for r £ {2,...,R}: 

(a) construct N r , an orthonormal basis for the null space of ^(^...r-i)]; 

(b) sample z ~ MF(N?H lr] ); 

(c) set X^ r j = N r z. 

Straightforward calculations show that the probability density of the matrix X generated as above 
can be expressed as 

f* ||ArJtf M /2||( m — ^ 1 r T i 

S(X) " ii r(^)w_ 1)/2( || W ) } 

with respect to the uniform density on Vj? jm . The ratio of the matrix MF density to g is then 



-r+1 /( m - r -i)/2(||JVr fl"[,r]ll) 



(m-r-l)/2 



Since Ik(x)/x k is an increasing function of x, and 1 1 N r H^ r j \ \ < \ \H^ r ]\\, we have 

PM ^ X) < oF^-m; [fx^-r-^ m - r + 1 / ( ^^(llffj ]||) ) = 

u v 2 '4 7 UjL v 2 7 || J H M ||( m - r - 1 )/ 2 J V ' 

For any particular X, the ratio of pMp(X)/g(X) to this upper bound is 

Pmf(X) A I {m - r - 1)/2 (\\N?H r \\) ( \\ Hr \\ \ (— D/ 2 
g(X)K(H) 1 = 1 / (m _ r _ 1)/2 (||^||) \\\N?H r \\) 

We use this bound and ratio in the rejection sampler described below. If H is not orthogonal, then 
although the bound is sharp the ratio may be extremely small: Consider the case in which and 
H^ 2 ] are both equal to the vector a. In this case, samples of X from pmf will have one or the other 
of its first two columns close to the direction of a with equal probability, whereas samples from g 
will generally have X^ close to the direction of a and X^ 2 ] orthogonal to it. In this case, WNj H^ 2 ] \ \ 
will be much smaller than ||ff[ j2 ]||, making the ratio quite small. The remedy to this problem is 
quite simple: To sample from the matrix MF distribution with a non-orthogonal concentration 
matrix C and singular value decomposition C = UDV T , first sample a matrix Y ~ MF(H) with H 
being the orthogonal matrix UD, using the above described rejection scheme, then set X = YV T . 
This procedure is summarized as follows: 
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m=10 



m=20 



m=200 
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R = 2 


R = A 


R = 6 
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R = 2 


R = A 


R = 6 
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R = 2 


R = A 


R = 6 


5 


0.07 


0.76 


5.08 


10 


0.11 


0.63 


3.80 


100 


0.03 


0.68 


2.28 


10 


0.09 


2.40 


26.52 


20 


0.24 


1.78 


15.25 


200 


0.20 


1.45 


10.04 


20 


0.38 


4.01 


77.74 


40 


0.30 


3.65 


42.70 


400 


0.35 


3.40 


36.52 



Table 1: Average number of rejected samples needed to generate a single sample from MF(C), for 
various values of m, R and singular values d. 

1. Obtain the singular value decomposition UDV T of C and let H = UD; 

2. Sample pairs {u, Y} until u < ^y)k[h) ' usrn S the following scheme: 

(a) sample u ~uniform(0,l) 

(b) sample Yni ~ MF(fTMi), and for r € {2, ... , R} consecutively, 

i. construct N r = NuU(Y"r m .„ r _i)]) 

ii. sample z ~ MF(N^H [>r] ) 

iii. set Yf r i = iV r 2 

3. Set X = y^ T 

To examine the feasibility of this rejection scheme a small simulation study was performed for 
m G {10, 20, 200} and £ {2, 4, 6}. For each combination of m and R, three m x R matrices were 
constructed, each having R singular values all equal to m/2,m or 2m. One-hundred samples from 
each MF distribution were generated using the above rejection scheme, and the average number of 
rejected samples are recorded in Table [TJ In general, the results indicate that the rejection sampler 
is a feasible method for sampling from the MF distribution for a broad range of values of m, R 
and D. However, we see that the average number of rejected samples is generally increasing in the 
magnitude of the elements of D and the ratio R/m. For large values of this latter ratio, ||iV^-ffr r i|| 
is typically a small fraction of | \H\ r ] \ \ and the ratio pmf(X) / g{X) is then rarely close to the bound. 
Similarly, large D leads to large differences between 1 1 Nj \ \ and 1 1 H\ r i \ \ for the higher values of 
r. 



2.2 A Gibbs sampling scheme 

For large values of D or R the rejection sampler described above may be prohibitively slow. As an 
alternative, a simple Gibbs sampling scheme can be used to construct a dependent Markov chain 
{X^\ X^ 2 \ . . .} which converges in distribution to pmf- Such a sequence can be constructed by 
iteratively sampling the columns of X from their full conditional distributions. 
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The density of X ~ MF(C) can be expressed in terms of a product over the columns of X: 



Pmf(X\C) oc etr(C T X) 
R 

oc JJexp(Cj r r] X w ) 

r=l 

The columns are not statistically independent of course, because they are orthogonal with proba- 
bility one. As such, we can rewrite X as X = {Xni, Xr ,-11} = {Nz,Xt—\]}, where z £ <S TO _i and 
N is an m x (m — 1) orthonormal basis for the null space of Xia. Note that N T N = I and so 
N T X^. Following Chapter 3 of Chikuse 2003 , the conditional distribution of z given Xr ji 



z 

is given by 



p(z\X[_ 1 j) oc exp(C^Nz) = exp(c J z) 

which is a vector MF density. This fact can be exploited to implement a Gibbs sampler that gener- 
ates a Markov chain in X. Given a current value X^ = X, the sampler proceeds by construction 
of a new value X^ s+1 ^ as follows: 

Given X^ = X, perform steps 1, 2 and 3 for each r £ {1, . . . , R} in random order: 

1. let N be the null space of Xt r i and let z = N T X[ r y, 



Wood 



1994 



2. sample z ~ MF(iV T C[ ir ]) using the sampling scheme of 

3. set Xr r ] = Nz. 

Set X( s+1 ) = X. 

Iteration of this algorithm generates a reversible aperiodic Markov chain {X^ , X^ ,...}. If m > R 
this Markov chain is also irreducible, and samples from the chain converge in distribution to MF(C). 
However, if m = R then the chain is reducible. This is because in this case the null space of Xr r i 
is one dimensional and therefore z £ {—1,1}, meaning that the samples from the Markov chain 
will remain fixed up to column- wise multiplication of ±1. The remedy for this situation is not too 
difficult: An irreducible Markov chain for the case m = R can be constructed by sampling two 
columns of X at a time. Details of such a procedure are given in the context of the more general 
Bingham-von Mises Fisher distribution in the next section. 

Finally, we note that non-orthogonality among the columns of C can add to the autocorrelation 
in the Gibbs sampler. This source of undesirable autocorrelation can be removed by performing 
the Gibbs sampler onF~ MF(H), where X = UDV T = YV T and H = UD, as described in the 
previous subsection. 
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3 Sampling from the Bingham-von Mises-Fisher distribution 



In this section a Markov chain Monte Carlo method for sampling from pbmf is derived. Similar 
to the approach in Section 2.2, the method involves iteratively resampling each column of X given 
the others using full conditional distributions, thereby generating a Markov chain {X^\ X^ 2 \ . . .} 
whose stationary distribution converges to Pbmf- We first present a method for sampling from the 
vector BMF distribution on x £ S m , and then show how this can be used iteratively to sample from 
the matrix valued extension. 

3.1 The vector Bingham distribution 

The Bingham distribution on the m-dimensional sphere has a density with respect to the uniform 
distribution given by 

Pb(x\A) ex exp(x T Ax), x G S m . 

Since x T Ax = x T A T x = ^x T (A + A T )x , A can be assumed to be symmetric. Let the eigenvalue 
decomposition of A be A = EAE T ', and let y = E T x. Since E is orthonormal the change of 
variables formula gives the density of y as 

p(y\E,A) = c(A)exp(y T E T EAE T Ey) 
= c(A) exp(y T Ay) 

m 

oc exp(^Ai^ 2 ). 

i=\ 

Again, this density is with respect to the uniform density on the sphere. If we write = 
1 — Ya^i the uniform density in terms of the unconstrained coordinates {yi, . . . , y m -\) is 
proportional to |y m | _1 = (1 — Y^I=i^ Ui)' 1 ^ 2 an d so the above density with respect to Lebesgue 
measure on y±, . . . , y m ~\ becomes 

m m— 1 

p(y\E,A) ocexp^A^lyJ- 1 , Vm = 1 - J^y?€[0,l]. 

1=1 8=1 

We now consider Gibbs sampling of the vector y. One possibility would be to sample components 



of y from their full conditional distributions. This is the approach taken by Kume and Walker |2006 
While straightforward, such an approach can result in a slowly mixing Markov chain because the 
full conditionals are so highly constrained. For example, the full conditional distribution of y\ 
given y2, ■ ■ ■ ,y m -i is non-zero only on y\ < 1 — Y^2~ l Vi- ^ s an alternative, let 8 = y\ and 
Q = {Vi/i 1 ~ Vi), ■ ■ -iVm/i. 1 ~ Vi)}, so that {yj, . . . ,y^} = {9,(1 - Q)q-\}. Sampling a new value 
of y\ = 9 £ (0, 1) conditional on g_i allows for larger redistributions of the "mass" of {yf, . . . , y^} 
while ensuring that Yyi uf = 1 • 
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Keeping in mind that y^ = 1~Y1T 1 Vi an< ^ 1m = 1 — Y^2 1 the joint density of {0, q2, ■ ■ ■ , q m -i} 
can be obtained from that of {yi, . . . , y m -i} as follows: 

= 2| yi | = 2flV2 , A = 2 M = 2 i/2 (1 _ ^-1,2 f>1 
*/i % 1 - 



and so 



d{yi, . . • ,y m -ij 



= exp(0Ai + (1 - e)q T _ 1 \- l ) x rV^l - 0)( m - 3 )/ 2 []^ 
p(0|g_i) oc exp(0[Ai - gljA-i]) x 9~ 1 I 2 {1 - 9) { - m '^l 2 . 



m 

1/2 

i 

2 



Sampling # G (0, 1) can proceed by computing the relative probabilities on a grid or with a rejection 
scheme as described at the end of Section 3.2. Iteration of this procedure with 8 = y 2 for each 
i G {1, . . . , m} and a similarly redefined g generates a Markov chain in {yf, . . . , y 2 ^} with a stationary 
distribution equal to p(y 2 , ■ ■ ■ , y^J-E, A). The signs of the y^'s do not affect the density and can 
each be randomly and independently assigned to be positive or negative with equal probability. 
The value of x is then obtained from x = Ey. To summarize, Markov chain Monte Carlo samples 
from ps{x\A) can be obtained by iterating the following algorithm: 

Given A = E T AE and a current value of x^ = x, 

1. compute y = E x; 

2. perform steps (a)-(d) for each i £ {1, . . . , m} in random order: 

(a) let {qi,...,q m } = {yf/(l - y 2 ), ■ ■ ■ , y 2 J(l - y 2 )}; 

(b) sampled G (0,1) from the density proportional to e e (*i-9-i*-i) x 9~ 1 / 2 (1 - (9)( m -3)/ 2 ; 

(c) sample Sj uniformly from { — 1, +1}; 

(d) set ?/j = SiO 1 / 2 and for each j / i, set y 2 = (1 — 9)qj leaving the sign unchanged. 

3. set x = Ey. 

Set x( s+1 ) = x. 

3.2 The vector Bingham-von Mises-Fisher distribution 

The Bingham-von Mises-Fisher (BMF) density adds a linear term to the quadratic of the Bingham 
density, so that p(a;|.A) oc exp(c T x + x 1 Ax). A Gibbs sampling algorithm for this distribution can 
proceed in nearly the same way as for the Bingham distribution. For the BMF distribution, the 
signs of the y^s are not uniformly distributed on { — 1, +1}, and so we parameterize y in terms of 9 
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and q as above but additionally let Sj = sign(yj). For a given value of the vector s the transformation 
between (9, q-±) and y is one-to-one, and the conditional density of {9, si} given q_\ and s_i is 

p(e,ai|g_i,«_i) OC je^Ai-^iA-i) x 0-l/2(! _ 0)(m-3)/2 j 

x exp(^V2 Sldl + (1 _ #)i/2 (s i g V 2 )r d _ 1 ) j 

where <i = i? T c. A value of {9,s\} can be sampled from its full conditional distribution by first 
sampling 9 £ (0, 1) from p(9\q-i, s-i) and then sampling s± conditional on 9. This results in the 
following modification steps 2(b) and 2(c) above: 

(b) sample 9 from the density proportional to p(9,Si = — l|g_j,s_j) + p(9,Si = +l|<7-j, s_j); 

(c) sample Sj G {— 1,+1} with probabilities proportional to {e~ el/2(li , e +el/2(li }. 

Again, the simplest way to sample 9 is by approximating its distribution on a grid of (0, 1). Alter- 
natively, it is not too hard to come up with efficient rejection sampling schemes for various values 
of the parameters. The density we need to sample from is of the form 

p{9) oc 0-V2(i _ d) k e ea+(i-e)^b x (e -*i/* c + e *va C) _ 

If k is larger than a, b or c then a beta(l/2,fc) distribution works well as a proposal distribution. On 
the other hand, if a is much larger than the other parameters then there will be a local mode close 
to 1. I have found that a beta(l/2, 1 + k A [(k — a) V —1/2]) proposal distribution works well for a 
wide range of scenarios where either a or k dominate the density (as they do for the data analysis 
in Section 4). Further details of a rejection sampling scheme based on this proposal distribution 
for 6 are available in the R-software companion to this article. 

3.3 The matrix Bingham-von Mises-Fisher distribution 

Expressing the density of X ~ BMF(A, B,C) in terms of a product over the columns of A, we have 

PB mf(X\A,B,C) oc eti(C T X + BX T AX) 

R 

oc H exp(Cf r] X [jr] + b r , r X^ r] AX lr] ) 

r=l 

As in Section 2, the columns are not statistically independent because they are orthogonal with 
probability one. We rewrite X as X = {Nz,Xi_i\} as before, where z G 5 m _i and N is an 
m x (m — 1) orthonormal basis for the null space of -X^-i] . The conditional density of z given X^_^ 
is 

p(z\X { _q) oc exp(C^j]Az + b hl z T N T ANz) = exp(c T z + z T Az) 

which is a vector BMF density. A Markov chain in X that converges to BMF(^4, B, C) can therefore 
be constructed as follows: 
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Given = X, perform steps 1-4 for each r G {1, . . . , R} in random order: 

1. let N be the null space of X^_ r j an d let z = N T X^ r y, 

2. compute c = N T C[^ and A = b r;r N T AN; 

3. update the elements of z using the Gibbs sampler for the vector BMF( J 4, c) density; 

4. set X[ >r ] = Nz. 

Set X( s+1 ) = X. 

Iteration of this algorithm generates a reversible aperiodic Markov chain {X^\ X^ 2 \ . . .} which is 
irreducible for m > R and thus converges in distribution to BMF(A, B, C). However, if m = R then 
the chain is reducible. As described in Section 2.2, this is because in this case the null space of X\_ r i 
is one dimensional and the samples from the Markov chain will remain fixed up to column-wise 
multiplication of ±1. The remedy for this situation is to sample multiple columns of X at a time. 
In particular, the full conditional distribution of two columns is easy to derive. For example, the 
null space N of -X"[ ; _( 1)2 )] is an m x 2 matrix, and so -X^^)] = NZ where Z is a 2 x 2 orthonormal 
matrix. The density of Z given X\_n 2 )] is 

p(Z) oc eti(C T Z + BZ T AZ) 

where C = N T C^i t2 )], B = diag(6i j i, 62,2) and A = N T AN . Since Z is orthogonal, we can 
parameterize it as 

sin</> — s cos (ft J 

for some (ft G (0, 2ir) and s = ±1. The second column Z^ 2 ] of Z is a linear function of the first column 
-^[,1], and the uniform density on the circle is constant in </>, so the joint density of ((ft, s) is simply 
p(Z(<ft, s)). Sampling from this distribution can be accomplished by first sampling (ft G (0, 2ir) from 
a density proportional to p(Z(<ft,—l)) + p(Z((ft, +1)), and then sampling s conditional on 0. To 
summarize, the Gibbs sampling scheme for the case m = R is as follows: 

Given X* 5 ) = X, perform steps 1-5 for each pair (r\,r 2 ) C {1, . . . , R} in random order: 

1. let iV be the null space of X^_^ ri ^y, 

2. compute C = iY T C[ ( riif , 2 )], B = diag(6 rijri , 6 r2 , r2 ) and A = N T AN; 

3. sample G (0, 27r) from the density proportional to p(Z((ft, —1)) + p(Z((ft, +1)); 

4. sample s G {—1, +1} with probabilities proportional to {p(Z((ft, —1)), p(Z(cft, +1))}; 

5. set Z = Z((ft,s) and X^ ri>r2 ^ = NZ. 

Set X( s+1 ) = X. 
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Figure 1: Descriptive plots of the protein interaction network. The first panel shows the complete 
dataset. The second panel gives a histogram of the degree distribution. 

4 Example: Eigenmodel estimation for network data 

In this section we use the model for network data described in the Introduction to analyze a 



symmetric binary matrix of protein-protein interaction data, originally described in Butland et al 



|2005 . For these data, yij = 1 if proteins i and j bind together and yij = otherwise. The data 



consist of pairwise measurements among m = 270 essential proteins of E. Coli. The interaction 
rate is y = 0.02, with most nodes (53%) having only 1 or 2 links. Nevertheless, the large connected 
component of the graph consists of 230 of the 270 nodes, as shown in the first panel of Figure [T] 

As described in the introduction, our model for these data is essentially a latent factor model 
with a probit link: 

Vi,3 = $(c,oo)( z i,j) 

z it j = ujAuj + e it j 
Z = UkU T + E, 

With U E Vn,mi this model can be thought of as an i?-dimensional latent eigenvalue decomposition 
for the graph Y. Hoff |2005 discusses parameter estimation for a version of this model, and Hoff 



and Ward| (2004 use such a model to describe networks of international relations. However, the 



approaches in these papers use normal distributions for the latent factors, leading to some identi- 
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fiability issues. For example, the magnitudes of the factors are confounded with the eigenvalues, 
and attempts to fix the scale of the Uj's lead to complicated constraints among their multivariate 
means, variances and covariances. 

A cleaner approach to modeling such data is to estimate {u±, . . . , u m } as being the rows of an 
m x R orthonormal matrix U. The probability density of a symmetric matrix Z with mean UAU T 
and off-diagonal unit variance is 

p(Z\U,L) oc etr[-(Z -UAU T ) T {Z -UAU T )/A] 

= etr(-Z T Z/4)etr(Z T [/A£/ T /2)etr(-A 2 /4). 

We call this model a latent eigenmodel, as the parameters U and A are the eigenvectors and 
values of the mean matrix of Z. The —1/4 has replaced the usual —1/2 because Z is symmetric. 
Additionally, the diagonal elements of Z have variance 2, but do not correspond to any observed 
data as the diagonal of Y is undefined. These diagonal elements are integrated over in the Markov 
chain Monte Carlo estimation scheme described below. 

Using a uniform prior distribution on U and independent normal(0, r 2 ) prior distributions for 
the elements of A gives 

R 

p(A\Z, U) = Yl dnorm(A r : mean = T 2 UjZU r /{2 + r 2 ), var = 2r 2 /(2 + r 2 )) 

r=l 

p(U\Z,A) oc eti(Z T UAU T /2) 
= etr{AU T ZU/2) 

= dBMF(C7 : A = Z/2, B = A, C = 0) 

where "dnorm" and "dBMF" denote the normal and BMF densities with the corresponding param- 
eters. Approximate posterior inference for U and A can be obtained via iterative Gibbs sampling of 
{U, A, Z, c} from their full conditional distributions given the data Y. One iteration of the sampling 
scheme consists of the following: 

Given {U, A, Z, c}^ = {U, A, Z, c} 

1. sample the columns of U from their full conditional distributions under BMF(Z/2, A, 0); 

2. sample the elements of A from their normal conditional distributions given above; 

3. sample the elements of Z from normal densities with mean UAU T but constrained to be 
above or below c depending on Y; 

4. sample c from a constrained normal distribution. 
Set {U,A,Z,c}( s+ V = {U,A,Z,c}. 
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Figure 2: Samples of A from the two independent Markov chains. 



A natural choice of the prior parameter r 2 is m, as this is roughly the variance of the eigenvalues 
of an m x m matrix of independent standard normal noise. 

There are several reasons for fitting a statistical model to these data: First of all, the undefined 
diagonal {y^j} precludes a standard eigenvalue decomposition of the original data. Second, even if 
the diagonal could be reasonably defined, the data are binary and so a decomposition on this raw 
data scale may be inappropriate. Additionally, a statistical model provides measures of uncertainty 
and predictive probabilities. The latter can be particularly useful in terms of outlier analysis: 
pairs for which j/j j = but Pr(yj = 1) is large might indicate a "missing link" and could warrant 
further investigation. 

A three-dimensional eigenmodel was fit to the protein interaction data using the Gibbs sampling 
scheme described above. Specifically, two independent Gibbs samplers of length 110,000 were 
constructed, one initiated with random starting values and the other with values obtained from the 
eigenvalue decomposition of a rank-based transformation of Y, in which the ranks of tied values 
were randomly assigned. Both chains converged to the same region of the parameter space after 
a few thousand iterations. A plot of the sequences of sampled eigenvalues from each of the chains 
is given in Figure [2] indicating one negative and two positive eigenvalues. For posterior analysis 
the first 10,000 iterations of each chain were dropped and only every 100th iteration was retained, 
leaving 1000 sample values from each of the two chains. From these samples we can calculate the 
posterior mean value of UAU T , which is not exactly a rank 3 matrix but is very close - its first 
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three eigenvectors accounted for 99.95 percent of the sum of squares of the posterior mean. The 
eigenvectors corresponding to the largest eigenvalues of this mean matrix can be reasonably thought 
of as a posterior point estimate of U. 

The eigenvectors corresponding to the two positive eigenvalues are plotted in the first panel of 
Figure [3J along with links between interacting protein pairs. Proteins with large values of uf^ + uf 2 
are plotted using their names. For positive eigenvalues, the interpretation of the parameters is that 
(ui 7 i , «i,2) and (1^1,1^,2) being in the same direction contributes to the tendency for there to be 
an interaction between nodes i and j. Additionally, in this model a network "hub" having many 
connections is modeled as having a large value of uf ^ + uf 2 an d makes most of its connections to 
proteins having factors of smaller magnitude but in the same direction. 

The second panel of Figure [3] displays a different aspect of the protein network. The plot 
identifies two groups of proteins having large positive and large negative values of {^,3} respectively. 
Members of each group are similar in the sense that they primarily interact with members of 
the opposite group but not with each other. The model captures this pattern with a negative 
eigenvalue A3, so that 1^3, Uj$ being large and of opposite sign is associated with a high probability 
of interaction between i and j. In this way, the latent eigenmodel is able to represent subnetworks 
that resemble bipartite graphs. 

For a detailed biological interpretation of the different hubs and clusters of the network, the 
reader is referred to Butland et al. (2005 . 



5 Discussion 



Distributions over the Stiefel manifold play an important role in spatial statistics, multivariate anal- 
ysis and random matrix theory. An important family of probability distributions over this manifold 
is the matrix Bingham- von Mises-Fisher family, which generalizes the vector- and matrix- valued von 
Mises-Fisher and Bingham distributions. This article has developed a rejection sampling scheme for 
the matrix MF distribution and a Gibbs sampling scheme for the matrix BMF distribution, thereby 
providing a useful tool for studying these complicated multivariate probability distributions. 

Additionally, it has been shown that members of the BMF family of distributions arise as 
conditional posterior distributions in Gaussian and probit models for multivariate data. Likelihood- 
based approaches to multivariate modeling may be necessary when the data are ordinal, missing 
or otherwise non-standard, and being able to sample from the BMF family allows for parameter 
estimation in these situations. 

R-functions to implement the sampling schemes outlined in this article are available at my 
website: http : //www. st at .Washington. edu/hoff/, 
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Figure 3: Plots of the latent eigenvectors. In the first panel, nodes with large values of \ui^\ are 
plotted with white circles. 
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